Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add multiple GPU support #760

Conversation

larsbijl
Copy link
Contributor

Add support to manage multiple GPU's similar to CPU's

  • Rename gpu to gpu memory across the board to make way for gpu min and
    max values.
  • Rename mem to memory to more descriptive.
  • Make gpu memory a proper value in the host reports not and additional
    attribute.
  • Add setting and updating GPU and GPU memory counts via the API.
  • GPU list is given to frames in RQD using the CUE_GPU_CORES env
    variable.

Missing from the MR.

  1. for simplicity I modified the Initial migration to incorporate all the changed needed in both the tables, functions and triggers.
    To keep backward compatibility for users it will need to make it into a migration.

  2. Our cuegui and rqd have diverged too much for easy merge. I've ported what I can, but it will likely be missing elements.

  3. We don't use windows, the GPU RQD side uses nvidia-smi directly. we will want to find an OS-agnostic method.

  4. tests. we will definitely need to write some tests.

Closes #459 #460

Add support to manage multiple GPU's similar to CPU's

- Rename gpu to gpu memory across the board to make way for gpu min and
max values.
- Rename mem to memory to more descriptive.
- Make gpu memory a proper value in the host reports not and additional
attribute.
- Add setting and updating GPU and GPU memory counts via the API.
- GPU list is given to frames in RQD using the CUE_GPU_CORES env
variable.

Missing from the MR.

1) for simplicity I modified the Initial migration to incorporate all the changed needed in both the tables, functions and triggers.
To keep backward compatibility for users it will need to make it into a migration.

2) Our cuegui and rqd have diverged too much for easy merge. I've ported what I can, but it will likely be missing elements.

3) We don't use windows, the GPU RQD side uses nvidia-smi directly. we will want to find an OS-agnostic method.

4) tests. we will definitely need to write some tests.
@larsbijl larsbijl force-pushed the 459-mutliple-gpu-support branch from babeb19 to cc2585b Compare August 16, 2020 14:02
Copy link
Collaborator

@bcipriano bcipriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Lars, first of all sorry for leaving this for so long.

I've reviewed it, and overall it looks good. I have some minor comments here, and there are some conflicts that need to be resolved, but if we can clear these up I think we can wrap this one up quickly.

@@ -17,6 +17,7 @@ CREATE TABLE frame_history (
int_mem_reserved BIGINT DEFAULT 0 NOT NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will definitely want to do this as a new migration instead of updating an existing migration.

That will also necessitate a minor version bump.

@@ -29,15 +29,15 @@
cores NMTOKEN #REQUIRED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this should become a new 1.10 version instead of modifying the existing version.

@splhack
Copy link
Contributor

splhack commented Feb 19, 2021

@larsbijl @bcipriano Do you mind if I help with this pull request? We'd like to use similar 8GPUs machines.
I'll keep @larsbijl with Co-authored-by 😉
I may split it into multiple PRs, like DB migrations + DTD + version bump, others, etc.

@bcipriano
Copy link
Collaborator

No objections here! We just discussed this at our last TSC meeting as a major item that would be great to wrap up ASAP.

@larsbijl
Copy link
Contributor Author

By all means! please go for it :)

splhack added a commit to splhack/OpenCue that referenced this pull request Feb 19, 2021
splhack added a commit to splhack/OpenCue that referenced this pull request Feb 19, 2021
splhack added a commit to splhack/OpenCue that referenced this pull request Feb 20, 2021
@splhack
Copy link
Contributor

splhack commented Feb 22, 2021

@larsbijl @bcipriano #924 is ready for code review. I may force push it again if I find something though.
I will add unittests for cuebot next.

bcipriano pushed a commit that referenced this pull request Jun 20, 2021
@splhack
Copy link
Contributor

splhack commented Jul 27, 2021

#924 is merged, we can close this 🙂

@bcipriano bcipriano closed this Jul 29, 2021
ramonfigueiredo added a commit to ramonfigueiredo/OpenCue that referenced this pull request Jul 24, 2024
…columns

- Fix the column indexing on the "addColumn" of class CueJobMonitorTree.
- This bug was introduced after the merge from the pull request "Add multiple GPU support AcademySoftwareFoundation#760 (AcademySoftwareFoundation#924)" on 4/18/22 at 11:45 AM where the following new columns were introduced on the CueJobMonitorTree: "Gpus", "Min Gpus", "Max Gpus", "MaxGpuMem" and the indexing of the columns were wrongly defined.
ramonfigueiredo added a commit that referenced this pull request Jul 26, 2024
Fix "Monitor Cue" with incorrect column indexing for "Min" and "Max"
columns
- Fix the column indexing on the "addColumn" of class CueJobMonitorTree.
- This bug was introduced after the merge from the pull request "Add
multiple GPU support #760 (#924)" on 4/18/22 at 11:45 AM where the
following new columns were introduced on the CueJobMonitorTree: "Gpus",
"Min Gpus", "Max Gpus", "MaxGpuMem" and the indexing of the columns were
wrongly defined.
n-jay pushed a commit to n-jay/OpenCue that referenced this pull request Jul 26, 2024
…dation#1431)

Fix "Monitor Cue" with incorrect column indexing for "Min" and "Max"
columns
- Fix the column indexing on the "addColumn" of class CueJobMonitorTree.
- This bug was introduced after the merge from the pull request "Add
multiple GPU support AcademySoftwareFoundation#760 (AcademySoftwareFoundation#924)" on 4/18/22 at 11:45 AM where the
following new columns were introduced on the CueJobMonitorTree: "Gpus",
"Min Gpus", "Max Gpus", "MaxGpuMem" and the indexing of the columns were
wrongly defined.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support GPU similar to CPU
3 participants